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A METHOD FOR GENE MAPPING FRQ MGENOTYPF, AND PHENQTYPE 
DATA. 

-r 1 

Field of the invention 

5 The present invention relates to a method for gene mapping from genotype and 
phenotype data, which method utilizes linkage disequilibrium between genetic 
markers m;, which are polymorphic nucleic acid or protein sequences or strings of 

single-nucleotide polymorphisms deriving from a chromosomal region. 

Background of the invention 

10 The use of linkage disequilibrium (LD) in detecting disease genes has recently 
drawn much attention in genetic epidemiology. LD is evaluated with association 
analysis, which, when applied to disease-gene mapping, requires the comparison of 
allele or haplotype frequencies between the affected and the control individuals, 
under the assumption that a reasonable proportion of disease-associated chromo- 

15 somes has been derived from a common ancestor. Traditional association analysis 
methods have long been used to test the involvement of candidate genes in diseases 
and, in special circumstances, to fine-map disease loci found by linkage methods. 
The testing has mostly been done using simple two-point measures. 

Improved statistical methods to detect LD have been presented lately (Terwilliger 
20 1995; Devlin et al. 1996; Lazzeroni 1998; McPeek and Strahs 1999; Service et al. 
1999). The newer methods are based on statistical models of LD around a disease 
susceptibility (DS) gene. Genomic regions - rather than alleles - that are shared 
among affected individuals, are searched for. The recombination history from the 
common ancestor to the present day is taken into account with more or less simpli- 
25 fied statistical models. The power of these methods, as well as their ability to local- 
ize the correct position of the DS gene, has been shown to be better than that of tra- 
ditional methods. Some of the models are robust against high levels of etiologic 
heterogeneity (McPeek and Strahs 1999; Service et al. 1999). However, the meth- 
ods contain assumptions about the inheritance model of the disease and the structure 
30 of the survey population, and the effects of violations of these assumptions in the 
real data are not known. In addition, they can only consider association of one re- 
gion at a time. Thus, they are currently best suited for fine mapping rather than 
complex disease mapping or genome screening. The methods also tend to be com- 
putationally heavy. 
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The present inventors have recently introdneed a so-called haplotype patten, tmmng 
tfPM) method (Toivonen e, al. 2000a and 2000b). In the HPM method, haplotype 
pauems a,e ordered by their strength of association with *e Phenotype and aU hap- 
Lype patterns exeeeding a given threshold level are used for predion of drsease 
5 usability gene loeation. The advantage of the HPM method ,s that u ts mode - 
free lit does not require any assumptions about the inhentance mode, of *e te- 
ease. The haplotype patterns are allowed to contain gaps and therefore : the HPM 
method is quite robust against mutations and to missing and -oneous daa. How- 
ever the basis of the HPM method is that haplotypes, ..e. separate vectors of alleles 
,0 of markers, are available. As wil. be explained below, mis requiremen r causes van- 
ous problems in gene mapping methods, and thus also m me HPM method. 
Zhang et al. (2002) have extended the HPM method to allow simultaneous use of 
haplotype data of related individuals with quantitative trait from an extended pedt- 
Z £b is done by employing the Quantitative Pedigree Dtsequ. hbnum Tes 
15 «PDT) statistic to measure the strength of association between haplotype and a 
quantitative trait. 

The standard procedure in association-based gene mapping is to 1) ascertain indi- 
viduate carrying the trait of interest and their family members .least paremsV2) 
senotype the individuals, 3) derive the haplotypes computattonally ustng genotypes 
20 within families, and Anally to 4) find associations in the haplotypes (gene mappmg). 

Even though me actual association analysis is done on sole case and I control haplo- 
types, obtaining these haplotypes requires the parents of the affected tndmduate £ 
genotyped as well: vast majority of haplo.yping programs avadable expect the 
ItZ genotypes to exist. This means that the parents first have to be reerurfcd, 

25 wh ctis not always straightforward, as they might no longer be ahve or canned be 
reached, or refuse from giving blood samples. Genotyping more md.v.duals . labo- 
rious and elevates the study expenses: per every case or control, 3 mdtvduals wrll 
L genotyped instead of just one, so genotyping is done on 3 times as many persons 
* " cases and controls. In case the non-transmitted parental chromosome 

30 couTd be used as controls, a case and his/her parents contributes one case-control 
p^ in which case the genotyping effort is 1 .5 times higher man me number of ca- 
ses and controls needed. 

As an alternative to these haplotyping approaches, some methods for direct haplo- 
Wtaglrom population-based data have indeed been presented, bu, the problem 
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with these is that they still produce a lot of mistakes, which is a very bad starting 
point for any haplotype based association program. 

There is no straightforward way to use genotypes as input for a method that is de- 
signed for haplotypes. Given a genotype, it is in principle possible to consider all 

5 haplotype configurations available from the genotype, and to run a haplotype gene 
mapping method on the different configurations of chromosomes. In practice, how- 
ever this is not feasible for marker maps of reasonable size due to a combinatorial 
explosion: given a genotype with N heterozygous markers, the number of different 
possible haplotype configurations is 2 N "' (or 1, if N = 0). F« instance, for N=100 

10 the number of possible haplotype configurations is about 6* 10 . 

Zhang and Zhao (2002) have studied linkage disequilibrium mapping directly with 
genotype data. Their approach is model-based and the method is based on the decay 
of haplotype sharing (DHS) method for haplotype data developed by McPeek and 
Strahs (1999) The approach of Zhang and Zhao is based on explicitly considenng 
15 all possible haplotype configurations. Since this is not feasible for marker maps of 
interesting sizes - as was described above - they apply complex and error-prone 
techniques to prune the number of haplotype configurations they need to consider. 
Further in this method, data consisting of multiallele (microsatellite) loci only - 
thus no SNPs (single nucleotide polymorphisms) or any other type of markers - is 
20 considered. In short, the method works as follows: it is supposed that there are two 
alleles in the disease locus: the disease causing allele D and the normal allele d. The 
basic idea is to treat the chromosomes in affected individuals as if they were a ran- 
dom sample from chromosome population consisting of both chromosomes with the 
D allele and d allele. Chromosomes in normal individuals are assumed to be a ran- 
25 dom sample of chromosome population only consisting of d chromosomes. Next a 
likelihood for individual haplotypes is formulated in the same way as presented by 
McPeek and Strahs (1999) where the probability of an observed haplotype is de- 
pendent of number of generations from original disease mutation, recombination 
rates between markers, and mutation rate in marker loci. The gap between using 
30 haplotype data as the starting point (as in McPeek and Strahs) and genotype data (as 
Zhang and Zhao present) as the starting point is bridged with following inference: 
for each genotype gl , there are several haplotype pairs that are compatible with it 
(2 N -' where N is the number of heterozygote sites in the genotype). The likelihood 
for an observed genotype is the sum of probabilities of each possible haplotype pair, 
35 where the probabilities of individual haplotypes are formulated as above. The ge- 
netic parameters of interest (such as location of the disease locus, mutation rate and 
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disease allele age) are then estimated by using EM algorithm. The large number of 
possible ancestral haplotypes prerequisites pruning out any too rare haplotypes; the 
haplotype frequencies are estimated with Markov model, and any which are below 
some prespecified level are left unconsidered. 

5 The approach of Zhang and Zhao has the following serious drawbacks. First, the 
principle of Zhang and Zhao is to explicitly consider all possible haplotype configu- 
rations. This is feasible only with very small marker maps. Second, to avoid the first 
problem and to extend the applicability of the approach to larger maps, Zhang and 
Zhao apply additional pruning techniques to reduce the number of haplotype con- 

10 figurations they need to consider. However, those techniques are complex and error- 
prone. Third, their approach is based on summing probabilities of different haplo- 
type configurations. Such an approach is not directly applicable to pattern-based 
mapping methods such as HPM. 

Curtis et al. (2001) studied the use of an artificial neural network to detect associa- 
15 tion between disease and multiple marker genotypes. The pattern-recognition prop- 
erties of the network were used in the hope that marker haplotypes implicit in the 
genotypes differed between cases and controls in a way which led to the network 
being able to classify the subjects correctly, according to their marker genotype. 

Summary of the invention 

20 The object of the present invention is to provide a model-free and computationally 
effective method allowing direct association analysis on genotype rather than haplo- 
type data which overcomes the above-mentioned drawbacks. The invention offers 
remarkable advantages by avoiding the technically difficult, costly and sometimes 
impossible steps of recruiting and genotyping family members, as well as by avoid- 

25 ing some of the error sources present in population-based haplotyping methods. 

The above-mentioned object is achieved in accordance with the invention by the 
method for gene mapping from genotype and phenotype data, which utilizes linkage 
disequilibrium between genetic markers m U which are polymorphic nucleic acid or 
protein sequences or strings of single-nucleotide polymorphisms deriving from a 
30 chromosomal region. The method according to the invention is characterized by the 
following steps: 

i) all marker patterns P that satisfy a pattern evaluation function e(P) are 
searched from the data, wherein 
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a. the marker patterns are expressions involving the marker-allele as- 
signments and zero or more of the following: individual covariates, 
environmental variables and auxiliary phenotypes; and 

b. the pattern evaluation function e(P) involves some statistical measure 
5 of the association between the marker pattern P and the phenotype be- 
ing studied, 

by testing each marker of pattern P against the corresponding allele pan- 
in genotype G, effectively finding out if there is a possible haplotype 
configuration of G which matches P and counting the possible matches 
10 as matches, 

ii) each marker m; of the data is scored by a marker score s( mrf, which is a 
function of the set 5, defined as the set of marker patterns overlapping 
the marker mj and satisfying the pattern evaluation function e as de- 
fined in step (i), and 

iii) the location of the gene is predicted as a function of the scores s(m0 of 
all the markers m, in the data and is based on maximizing the score if 
the scoring function is designed to give higher scores closer to the gene, 
and on minimizing the score if the scoring function is designed to give 
lower scores closer to the gene, as is the case for instance when the 
scores s(m0 are marker-wise p values. A computer-readable data stor- 
age medium according to the invention has computer-executable pro- 
gram code stored thereon, said executable program code being opera- 
tive to perform a method of any embodiments of the invention when 
executed on a computer. 

25 A computer system according to the invention is programmed to perform the 
method of any embodiments of the invention. 

As used herein the term 'haplotype' defines a vector of alleles in a single chromo- 
some. Also, as used herein the term 'genotype' defines a vector of (unphased) allele 
pairs in a chromosome pair. 

30 The term , microsatellite' used defines a small run (usually less than 0.1 kb) of tan- 
dem repeats of a very simple DNA sequence, usually 1-4 bp, for example (CA)n. It 
has been used as the primary tool for genetic mapping during the 1990s. 'Multialle- 
lic genetic locus' is a gene with high level of variation; there are several types of 
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variants in the gene locus, each with reasonably high frequency. 'SNP\ single nu- 
cleotide polymorphism, defines any polymorphic variation at a single nucleotide. 
Although less informative than microsatellites, SNPs are more amenable to large- 
scale automated scoring. 

5 Brief description of the drawings 

Figure 1 shows the localization accuracy of HPM-G compared to HPM: the y axis 
shows which fraction of simulated data sets is in the predicted region, the length of 
which is given on the x axis. 

Figure 2 shows the effect of sample size on localization accuracy with a) HPM-G 
10 (sample size in people) and b) HPM (sample size in chromosomes). 

Figure 3 shows the effect of missing data (5%, 10%) on localization accuracy with 
a) HPM-G (150 affected and 150 control individuals) and b) HPM (200 disease- 
associated and 200 control chromosomes). 

Figure 4 shows the effect of 100 permutations on localization accuracy. 
1 5 Detailed description of the invention 

The object of the present invention is to provide a method for gene mapping from 
genotype and phenotype data, which utilizes linkage disequilibrium between genetic 
markers m* which are polymorphic nucleic acid or protein sequences or strings of 
single-nucleotide polymorphisms deriving from a chromosomal region. The chro- 
mosome data may consist of genotypes or haplotypes. The phenotype being studied 
may also be a combination of several phenotypes. 

The method according to the invention, also named as haplotype pattern mining in 
genotype data (HPM-G), uses data mining methods in LD-based gene mapping. The 
method uses both genotypes and haplotypes as input. In diseases with reasonable 
genetic contribution, affected individuals are likely to have higher frequencies of 
associated marker alleles near the DS gene than control individuals. Combinations 
of marker alleles which are more frequent in genotypes of affected individuals than 
in genotypes of unaffected individuals, are searched for in the data, without assump- 
tions about the mode of inheritance of the disease. These combinations, marker pat- 
terns or haplotype patterns, are sorted by the strength of their association with the 
disease and the resulting list of marker or haplotype patterns is used in localizing 
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the DS gene. Terms marker pattern and haplotype pattern denote the same concept, 
and are used interchangeably in this text. 

The method according to the present invention is an algorithm-based extension of 
traditional association analysis. It works with a non-parametric statistical model and 

5 without any genetic models. The localization power of the method of the invention 
is high even in cases, where multiple independent founder mutations are allowed, 
and the frequency of the most common mutation in the affected chromosomes var- 
ies between 5-15% at realistic sample sizes (100 affected individuals and a similar 
number of population controls). In addition, the experiments suggest that the 

10 method is highly robust against missing data. Since HPM-G can handle high de- 
grees of etiologic heterogeneity, it can be successful in complex disease mapping. 

LD the non-random association of marker alleles and haplotypes to the disease, is 
likely to be strongest around the DS gene; consequently the locus is likely to be 
where most of the strongest associations are. In the HPM-G method according to 
15 the invention, we search for shared, flexible haplotypes that may contain gaps, and 
find out, which ones are strongly associated with the disease status. We then use a 
non-parametric model for predicting the DS locus, on the basis of the locations of 
the haplotypes. Permutation tests can be used to contrast the results against the null 
hypothesis that there is no gene effect. 

20 Marker or Haplotype Patterns and Disease Association 

We examine linkage disequilibrium by looking for marker or haplotype patterns that 
consist of a set of nearby markers, not necessarily consecutive ones. Given a marker 
map M with k markers m h ...,m h a "marker pattern" or "haplotype pattern P on M 
is defined as a vector (p h ..., Pk \ where each Pi is either an allele of marker m; or the 
25 "don't care" symbol (*). The haplotype pattern P occurs in a given haplotype vector 
(ctomosome)^;,..^) Upr*i « Pi= * for all U<=*<=*- Pattern P occurs in a 
given genotype G=({gn, g 12 ), Uk.^}) if Pi=8il or P i=g i2 or Pi =* for all 
i,\<=i<=k. 

For example, consider a marker map of 10 markers. The vector Pi = (*, 2, 5, *, 3, *, 
30 *, *, *, *), where 1, 2, 3,... are marker alleles, is an example of a haplotype pattern. 
This pattern occurs, for instance, in a chromosome with haplotype (4, 2, 5, 1, 3, 2, 6, 
4 5 3) The pattern also occurs in the genotype ({2,5}, {2,3}, {1,5}, {4,6}, {3,6}, 

{2,4}, {1,2}, {1, 4}, {3,5}, {1, 6}). (For instance, {2,5} is the genotype of marker 

1; the alleles are 2 and 5, but their phases are not known.) 
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Our goal is to search for haplotype patterns that roughly correspond to haplotypes 
identical by descent in the disease-associated. In doing this, there are two major is- 
sues with respect to the shapes of haplotype patterns: the genetic length of the sig- 
nificant part of the patterns, and gaps. We define the "(genetic) length" of a haplo- 

5 type pattern P=(p h ..,Pk) as the maximum distance, in Morgans, between any two 
markers m it mj with Pi * * * pj. Searching for haplotype patterns of arbitrary length 
hardly makes sense; it is unlikely that genetically extremely long patterns will be 
discovered, at least not in significant numbers. Consequently, when haplotype pat- 
terns are searched for, the maximum length of patterns to be considered can be con- 

10 strained with an optional pattern-search parameter to the HPM-G method. 

We allow for gaps in the marker patterns, since mutations, gene conversions, errors, 
missing data, and recombinations can corrupt continuous haplotypes. Marker muta- 
tions and errors typically cause very short gaps only. Missing information can span 
several consecutive markers, depending on the data collection scheme. Longer gaps 
15 can be introduced by double recombinations which, however, are rare on genetically 
short distances. In the HPM-G method, the maximum number and maximum length 
of gaps can be controlled with pattern search parameters. 

Mining Disease-Associated Haplotype Patterns 

All marker patterns P that satisfy a pattern evaluation function e(P) are searched 
20 from the data in the step (i) of the method of the invention. The pattern evaluation 
function e(P) involves some statistical measure of the association between the 
marker pattern P and the phenotype being studied. In step (ii), each marker m; of the 
data is scored by a marker score s(m0, which is a function of the set defined as 
the set of marker patterns overlapping the marker and satisfying the pattern 
25 evaluation function e as defined in step (i). 

In step (i), let U be the universe of marker patterns considered in the study. The 
output S of this phase is the set of those marker patterns that satisfy e, i.e., S={P€ 
U\e(P) is true}. 

In step (ii), for each marker m t in the data, let S t = {P e U \ P makes a reference to 
30 mi or to a marker to the left of and to a marker to the right ofm t } be the set of pat- 
terns in 5 that overlap with the marker m/. In this phase each marker m t is scored as 
a function of Sj f and the result is s(m0. 

In step (iii), the location of the gene is predicted as a function of the scores s(m0 of 
all the markers m t in the data. This function returns an area where the gene is likely 
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case. 



The marker or haplotype patterns P can be searched for by an algonthm developed 
by the inventors for this purpose or by the levelwise search method described in 
Mannila and Toivonen (1997). Preferred algorithms are given in the following. 

Version, 1 of fr « al portthm f™ marker pattern searchin] 

The following algorithm is a simple, generic, and efficient way to implement step 
(i) of the method according to the invention. It is based on depth-first search in die 
space of patterns, a standard procedure in computer science. A pre-requisite is that 
there is a suitable generalization relation for the patterns, such that if a pattern satis- 
fies the evaluation function, then all more general patterns also satisfy it. 



Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 
15 • (generalization) relation < for patterns in U 

. where the function e and the relation < are such that if e(P) is true and P < P, 

then e(P r ) is also true 
Output 

• set S = {P G U | e{P) is true) of patterns 

20 Method 

1. S: = {) 

2. // Initialize the set of evaluated patterns: 

3. £:={} 

4. // Start with the most general patterns: 

25 5. Gen:= {P 'mU\ there is no P' in U, P' != P, such that P'<P) 

6. // Recursively evaluate patterns in a depth first order: 

7. foreach P e Gen { evaluatePatterns(P) } 

8. end; 

9. procedure evaluatePatterns^) { 
30 10. insert P into the set E 

11. if e(P) = true then { 

12. insert P into set S 

13. // Find all specializations of P that have not been tested yet, and 

14. // evaluate them recursively: 
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15. Spec := {P'in U-E \ P < P\ P' != P. and there is no P" in U-E, P"\=P 

16. and P" != P', with P<P"< P'}; 

17. foreach P' in Spec { evaluatePatterns(P')'. } 

18. } 
5 19.} 

Version 2 of the algorithm for m arker pattern searching 

The following algorithm is a simple, generic, and efficient way to implement step 
(i) of the method according to the invention. It is based on depth-first search in the 
space of patterns, a standard procedure in computer science. It approximates the 
10 exact answer by ignoring infrequent and therefore statistically less important pat- 
terns. 

Define an auxiliary evaluation function ae(P) which is true if and only if the fre- 
quency of pattern P exceeds a given threshold x, (how to specify the threshold is 
discussed elsewhere) and replace the original evaluation function e(P) by function 
15 e'(P) defined as e'(P) = true if and only ife(P) is true and ae(P) is true. This re- 
finement prunes patterns that satisfy e but are no more frequent than x. Such infre- 
quent patterns are statistically not relevant, and therefore little information is lost 
when they are ignored. Now a suitable generalization relation is obtained from logi- 
cal implication based on the pattern syntax: P < P' if and only if P' -> P. 

20 The algorithm uses the generalization relation based on logical implication to struc- 
ture the search space, and the auxiliary function ae to prune the search space. All 
patterns satisfying ae are searched for, but only those also satisfying e are output. 

Input 

• set U of possible marker patterns 

25 • evaluation function e(P) for patterns P in U 

• frequency threshold x 

Output 

• set 5 = {P in U \ e(P) and ae(P) is true} of patterns, where ae(P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 

30 Method 
20.5: = {} 

21. // Initialize the set of evaluated patterns: 

22. £:= {} 
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23. // Start with the most general patterns: 

24. Gen :={P in U\ there is no P' in U, P' != P, such that P ~> P' ) 

25. // Recursively evaluate patterns in a depth-first order: 

26. foreach P in Gen { evaluatePatterns(P) } 

5 27. end 

28. procedure evaluatePatterns(P) { 

29 . insert P into the set E 

30. if ae(P) = true then { 
31 if e(P) - true then insert P into set S 

10 32. // Find all specializations of P that have not been tested yet, and evaluate 

33. // them recursively: 

34. Spec := { P' in U-E \ P' -> P, P' '•= P. and there is no P" in U-E, P" !- P 
35 and P" != P\ with P' -> P" and P" -> P } 
36. foreach P' in Spec { evaluatePatterns(P') } 

15 37. } 
38.) 

Version 3 of the al gorithm fr>r marker p attern searchin] 

When phenotype being studied is qualitative and the pattern evaluation function 
e(P) has the form e(P) = true if and only ife'(P) > x, where e'(P) is the (signed) as- 

20 sociation measure % 2 and x is a user-specified minimum value, which is chosen so 
that the sizes of S f are large enough, such as 7, to give statistically sufficiently reli- 
able estimates for the gene locus, the following algorithm is a simple, genenc, and 
efficient way to implement step (i) of the method according to the invention. It is 
based on depth-first search in the syntactic space of patterns. It derives a lower 

25 bound lb for pattern frequency from the given lower bound x for chi-squared test, 
and uses lb to prune the search. 

Input 

• marker map M = (mj, ... ,m0 

• phenotype vector Y = (Yj, Y n ) 

30 • genotype matrix H of size n * k * 2 (n persons, k markers, 2 alleles per person 

and marker) 

• association threshold x for chi-squared test 

• maximum pattern length / 

• maximum number of gaps g 
35 • maximum gap size s 
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Output 

• setS= {P in U \ e(P) is true) of patterns, 

. where U consists of patterns on M that consist of marker-allele assignments and 
that adhere to parameters I, g, and i, and 
5 . where e(P) is true if and only if chi-squared test on P using genotype matrix H 
and phenotypes Y exceeds the given threshold x 

Method 
39.S: = {} 

40. // Number of case and control persons: 
10 4 1 . pi A '•= number of affected persons; 

42. pic '= number of control persons; 

43. pt :=piA+piC 

44. // A lower bound for pattern frequency: 
45. lb :=pi A *pi*x/ (pic * + P l A * *) 

15 46 // Variable for iterating over different patterns: 

41. P = (P1, .-.Pic)- ■=('*'. ■•■•*•*) 

48. for i:= i tofc { . 

49. // alleles(mj) is the set of alleles of the i:th marker 

50. foreach a in alleles(mj) { 

20 51. Pi •'= a 

52. // Test pattern P and all its extensions: 

53. checkPatterns(P, i, i, 0, 0) 

54. // Reset pf 

55. Pi ••= '*' 

25 56. } 

57. } 

58. end ,. D 

59. // Test haplotype pattern P and all patterns that can be generated by extending P 

60. // from the right: 

30 61 . procedure checkPatterns(P, start, i, nr_of_gaps, gapjength) { 

62. // Output strongly associated patterns 

63. if chi-squared(P, M, H,Y)>=x and Pi != '*' then insert P into set 5 

64. // Return if extended patterns would be too long: 
65 if i = k or i+1 -start > I then return 

35 66. // Return if extended patterns can not be strongly disease-associated: 

67. if frequency of P in disease-associated persons is less than lb 

68. then return; 
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69. // Create and test legal extensions of current pattern P (3 cases): 

70. // 1 • Give marker i+ 1 all possible values : 

71. foreach a in alleles(m I+ i) { 

72. Pi+1 -= a 

5 73. checkPatterns (P, start, nr_of_gaps, 0) 

74. } 

75 . 112. Introduce a new gap starting at marker i+ 1 : 

76. if Pi * '*' and nr_of_gaps < g and 5 > 1 then { 

77. Pi+i := '*' 

10 78. checkPatterns (P start, i+A nr_of_gaps+l, 1) 

79. } 

80. // 3 • Extend the current gap over marker i + 1 : 

81. if Pi = '*' and gapjength < s then { 

82. Pi+7 •='*' 

15 83. checkPatterns (P, sfar*. i+i, nr_of_gaps, gapjength* J) 

84. } 

85 . // Before returning, reset p j+ ] : 

86. P/+2 := '*' 

87. return 
20 88. } 

Version 4 of the al porith r n for marker pattern searching 

The following algorithm is a simple, generic, and efficient way to implement step 
0) ofTe method according to the invention. It is based on the levelwrse search 
method described in Mannila and Toivonen (1997). 

25 Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 

. (generalization) relation < for patterns in U. where the function e and the rela- 
tion < are such that if «(P) is true and P' < P, then e(P') is also true 



30 Output 

• set S = {P in U \ e(P) is true) of patterns 



Definitions 
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. function L g8 :U->2»,L g8 (P)={P' in U \P > P' andP' \=P and there is no 
P» in U such thatP !=P" \= P' and P > P" > P')> the set of least general gen- 

eralizations of pattern P. 
. function Lss: U ->2», Lss(F)= { P' r*V \P <P' and P' \= P and *ere ,s no 

5 F-inU such that P !=/•"!= P' andP <P" <P%** of least spec.al spe- 
cializations of pattern P. 

Method 
89.5 : = {} 
90.e: = U 

10 91 • // Start with the most general patterns: 

92. F : = {P in | *«<?re w no P' in U, P' != F, that P' < P); 

93. while F\= {} { 

94. // Evaluate the candidate patterns: 

95. foreachPinF I 

15 96. if c(F) = true then insert P into set S 

97. else remove P from set F 

98. } 

99. Q ' -Q union F 

100. // Generate a new set of candidate patterns: 

20 101. C: = {) 

102. foreachPinF { 

103 C : = C union { P' in t/ 1 P' in Lss(P) and/or a// P in Lgg(P): 

104*. 

105. } 
25 106. F : = C 

107. } 

108. end 

^c^n s nf the algorithm for marker nattem searchinj 
This is the levelwise search version of the algorithm 2. 

30 Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 

• frequency threshold x 
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HeTs = {P in V | eiP) and «*(/>) is ,me) of patterns, where oe(P) is true if and 
only if Che frequency of pattern P exceeds a given threshold x 

Definitions /> , ,_ p ^ theK „ n0 

5 . functron I» B -> 2 : . I« CP • ( ? » « ^ , 
P" in (7 suc/i rfcar P != P '•= P ana r > r h 
generalizations of pattern P. , . 

. function Lss- U -> 2», Us(P) = { P' inU | P' -> P and P' != P ami iter, u 
• funCtl ° n " . . # „ . p» p« and P'->P"-> P), the set of least special 
P" in U such that P != P i- r ana r * r f , 

10 specializations of pattern P . 

Method 

109. S:={) 

110. Q:= {} 

111 // Start with the most general patterns: 
15 112'. F := (P in U \ there is no P' in U, P' P. *«* P "> P ' > ; 

113. while F!= (} { 

1 14. // Evaluate the candidate patterns: 

115. foreachPinF { 

1 16. if aeiP) = true then { 

20 1 17 . if e ( p ) = true men insert P int ° S6t 5 

118. } 

119. else remove P from set F 

120. } 

121. Q: = QurdonF 

25 122. // Generate a new set of candidate patterns: 

123. C: = U 

124. foreachPinF { „„„. , , D '\. 
125 C : = C union { P' in U \ P' in Lss(P)andfor all P wLgg{P). 

126*. P"'^Q^ 
30 127. } 

128. F : = C 

129. } 

130. end 

The phenotype being studied may be qualitative, for example disease is present or 
35 " absent. In mat case, the pattern eva.»a,ion function e(P) may have the 

S« = true if and only if > * where .W is «e (signed) assoctauon 



PCT/FI03/00248 

WO 03/085585 

16 

measure X 2 and x is a user-specified minimum value, which is chosen so that the 
sizes of S.are large enough, such as 7, to give statistically sufficiently rehaWe esti- 
mates for the gene locus and the score s(m0 of marker m is the size of S h also 
called marker-wise pattern frequency of mj and denoted by f[m0. 

5 As mentioned above, the (signed) X 2 is a measure of marker-disease association. A 
signed version of the measure is used in order to discriminate disease association 
from control association. The signed X 2 measure ±£(P) of a haplotype pattern P is 
positive if P is more frequent in cases than in controls, and negative otherwise. 
Given a "(positive) association threshold" x, we say that P is "strongly associated 
10 with the disease if ±X (TO*. 

The first part of the HPM-G method can be described as follows. Given the data - 
markers M, genotypes H, and phenotypes Y - the task is to output all haplotype 
patterns P that are strongly associated with the disease status for a given value of 
L association threshold x. We denote the collection of all such haplotype patterns 
15 by S - that is, 5 = {P is a haplotype pattern on M | ± X 2 (P)>x}. If pattern parame- 
ters are specified - a maximum genetic length, a maximum number of gaps, or a 
maximum length for gaps - the task is refined by requiring that these additional 
restrictions are also fulfilled. 

The signed % 2 value is calculated from a 2x2 contingency table, where the rows cor- 
20 respond to the trait-association statuses of the persons and the columns correspond 
to the presence and absence of the haplotype pattern. A pattern P=(PL-,Pk) ^ pre 
sent in a given genotype G=({g,„ g. 2 ), Uki^}) XPi=8il or Pi =gi2 or pj- tor 
all i K=i<=*. If, instead of a genotype, two haplotype vectors H,=(hll,-Mk) and 
HMh21,~*2k) are given for a person, pattern P is considered to be present in the 
25 person if it is present in either of the haplotypes, i.e., if either Pi =h n or Pi - for all 
i,\<=i<=k, or P i=h2i orpi=*forall i,K=i<=k. 

The value of % 2 statistic is computed normally, and a negative sign is attached, if the 
relative frequency of the haplotype pattern among the control persons is higher than 
among the trait-associated persons. 

30 The first observation in solving the pattern-mining task is that given an association 
threshold x, a lower bound can be derived for the frequency of strongly associated 
haplotype patterns as follows: 
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Given a 2x2 contingency table of the numbers of disease-associated (A) and control 
(Q persons either matching a pattern (P) or not (N), the * 2 test statistic for the dis- 
ease association of the pattern is defined by 

(ft*„ -ft rN -n^ n CP f n 

— — — - — — j 
n A ' n C ' n P ' n N 

5 where tc^ is the number of persons with properties i and j, %i the number of persons 
with property i, and n the total number of persons. Given the number of affected 
persons (n A ), the number of control persons (tc C ), and a lower bound x for the test 
statistic we can derive a lower bound for the pattern frequency among the affected 
persons (n AP ) as follows. Assuming the pattern is disease-associated, we have 

10 KAP - *CN > % AN ' n CP- The test statistic is maximized when k C P = 0, implying 
n AP ~ n P anc * n CN ~ n C- Then 

n A -n c -n P -n N n A -n c -n AP \n-7t P ) n A \n-n AP ) 

and 

IT -Tt -IT ft a '71' X 

n AP n cJL-> x ^ nAp > * 



n A '{ft-ft AP ) 7t c -fi+ft A 'X 

15 The situation is symmetric for protective haplotypes, and the lower bound for n C P 
is obtained by simply swapping it A and % C in the above result. If disease-associated 
and protective haplotypes are searched for at the same time, the smaller of n A P and 
n C p can be used as a lower bound for n P , making the implementation slightly sim- 
pler. 

20 On another hand, given such a frequency threshold, all patterns exceeding the 
threshold can be enumerated efficiently with data-mining algorithms or a standard 
depth-first search method. An algorithm that first finds all haplotype patterns whose 
frequency exceeds the computed lower bound and then evaluates the association 
measure on them, is guaranteed to find the exact set of strongly disease-associated 

25 patterns. 

The approach is suitable for finding protective haplotype patterns by considering 
patterns P with ±X?(P) < The derivation of the lower bound for the frequency 
among controls is identical to the case above. Obviously, both disease-associated 
and protective haplotypes can be found when |±X 2 (^)| ^ *• 
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In another embodiment according to the invention, the phenotype being studied may 
be, in addition to qualitative, also quantitative, for example a measured blood con- 
centration of a substance has a certain value. In that case, the pattern evaluation 
function e(P) may have the form e(P) = true if and only ife'(P) > x, where e'(P) is 
5 the absolute frequency of pattern P in the data and x is a user-specified value, which 
is chosen so that the sizes of 5, are large enough, such as 20, to give statistically suf- 
ficiently reliable estimates for the gene locus. In this embodiment, the statistical 
strength of the method may be still increased. 

A linear model is of form Y = faX x + . . . + &X k + aZ + ft, where the dependent 
10 variable Y is a quantitative phenotype, X x through X k are covariates, such as envi- 
ronmental factors, and Z is a dummy variable for the occurrence of the haplotype 
pattern. Firstly, the coefficients a and 0* are adjusted for best fit. Secondly, the sig- 
nificance of Z as a covariate is assessed by using a t test. If the phenotype is di- 
chotomous, then the logit transformation can be applied. 

15 Marker scoring in the case of qualitative phenotype being studied 

Haplotype patterns close to the DS locus are likely to have stronger association than 
haplotypes further away; consequently the locus is likely to be where most of the 
strongest associations are. In one embodiment according to the invention, the 
marker score s(m0 is defined as the marker frequency f(m0 of marker (with re- 

20 spect to M,H,Y,x) as the number of patterns that contain marker m;, possibly mi in a 
gap: /(mj) = \{P =(pi,-,P0 ^ S | there exist t < i and u > i such thatp^ * *Pu)\- 
The idea is that each haplotype pattern roughly corresponds to a continuous chro- 
mosomal region, potentially identical by descent, where gaps allow for corruption 
of marker data. While markers within gaps are not used in measuring the disease 

25 association of the pattern, the whole chromosomal region of the pattern is thought to 
be relevant. 

Marker scoring in the case of qualitative or quantitative phenotype being studied 

In order to derive the score s(mO, the p value (statistical significance) of each 

marker pattern P in determining the phenotype being studied is evaluated, and the 
30 score s(mO is the distance between the observed p value distribution of patterns in 5, 

and the uniform distribution, defined as average of (p, - qd log (p, / qd over all i = 
l..n, where n is the number of haplotype patterns in 5/, p t is the ith smallest p value 

in S h and q t is the expectation of the ith smallest p value, if the p values were ran- 
domly drawn from the uniform distribution. 
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Gene localization 

The location of the gene, predicted as a function of the scores s( mi ) and based on 
maximizing or minimizing the score, is predicted to 

- the location of the marker m; that maximizes or minimizes the marker score s(m0, 
5 or 

- the combination of most probable intervals for containing the trait-susceptibility 
locus that covers at most the desired proportion t (K= {0,100%}) of the original re- 
gion obtained by taking all such points in the studied chromosomal region whose 
nearest marker is within the k best scoring markers, where k is selected such that the 

10 resulting area has length at most t times the length of the studied region, and where 
k is maximal such value, or 

- those points in the studied chromosomal region whose nearest marker scores at 
least y or at most y, where y is scoring function dependent and is selected so that the 
probability of the gene being close to the marker is sufficiently large. 

15 The location of the gene may also be determined by expert investigation of the 
marker scores or their visualization e.g. as a curve. 

Permutation Tests 

More information about the significance of the observed scores may be obtained by 
permutation tests. The results obtained by considering the marker frequencies or the 

20 linear model, as explained earlier, can be contrasted against the null hypothesis that 
all the persons are drawn from the same distribution; that is, there is no gene effect 
in the disease status. We propose to permute randomly the status fields of the per- 
sons keeping the proportions of affected and control persons constant, m a fashion 
similar to the method of Churchill and Doerge (1994). We approximate marker- 

25 wise p values using permutations and then predict the DS gene to be in the vicinity 
of the marker with the smallest empirical p value. Consecutive markers are depend- 
ent and thus a large number of mutually dependent p values are produced. This is 
not a problem, since we do not use the p values for hypothesis testing, but only for 
ranking markers. 

30 Marker-wise p values are used to re-score markers by their statistical unexpected- 
ness The test is carried out as follows: The phenotypes of the persons are randomly 
shuffled a number (thousands) of times. The scores are re-calculated for each per- 
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mutation in turn. Marker-wise p value p(m0 is the proportion of such permutation 
scores for marker m* that are larger than or equal to the non-permuted score. 

Each score s(m0 is then refined by replacing it by the marker-wise p value p(mO of 
the score s(m0. 

5 Searching several genes 

Several genes may be searched for simultaneously by using marker patterns that 
refer to several potential gene loci at the same time. 

Examples 

Certain embodiments and results of the present invention are described in the fol- 
io lowing non-limiting examples. 

Example 1 - Simulated Data Sets 

We evaluated the performance of the proposed HPM-G method with simulated data 
sets that correspond to a recently founded, relatively isolated founder ^popula- 
tion Simulation of a population isolate was chosen, since it is recommended as the 
15 study population for LD studies. However, the method can be applied to any popu- 
lation that is suitable for LD analysis, since no assumptions are made about the 
population structure. 

An isolated founder population, which grows from the initial size of 200 to 100,000 
individuals in 20 generations, was simulated. 

20 The population pedigree was first generated assuming distinct generations and ex- 
ponential growth of the population size. In each generation, the parents of the new- 
born individuals were randomly selected from members of the previous generation, 
with the exception that whenever a parent with at least one child was chosen, his/her 
spouse was always forced to become the other parent of the child. This procedure 

25 generates family structure into each generation. 

In the simulation of inheritance, each member of the first generation was assigned 
to have one pair of homologous chromosomes. The genetic length of the chromo- 
somes was 100 cM for both males and females. Meiosis was repeatedly simulated, 
and in each meiosis the number of crossover points was taken from a Poisson distn- 
30 bution with parameter value 1 , which corresponds to the total genetic length of the 
chromosome. No chiasm interference was modeled. To accommodate the fact that 
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ever-increasing informativeness of marker maps may soon facilitate whole-genome 
^ mapping, we used relatively dense and informative marker maps with m er- 
rlZ LLls of exactly one cM. Each marker contained 4 alleles whose fre- 
lencL t the founder population were 0.4 for one allele and 0.2 for the remammg 
; q Z 7Z. The polymorphism information content (PIC) of each marker was thus 

fixed at 0.678. 

To produce each of the 100 data sets for HPM-G and HPM, .he processes of disease 
locus selection, diagnosing and sampling were done independendy. Next, these 
processes are described. 
0 For each data set, a random locus was selected as a disease locus and 8 random 
ermines present in me original population were labeled - <£"~™ 
chromosomes. In the final population, all chromosomes that had mhented the drs 
ease locus identical-by-descent from one of tire eight founders were constdeted to 
carry a disease-causing mutation. 

L3 In the diagnosing phase, we used a liability-based model, where an ****** 
probability of becoming affected depends on two factors: the presence of a d.sease 
£££ Ld a normally distributed random component. The random cmnponenus 
.hough, to contain factors such as environmental effects and effect 
taown genes. The liability of an individual is defined as L = 5,« !+ C, where 
20 indicator variable *, indicates the presence of any of the disease-causing mutations, 
"ab.e x 2 is randomly sampled from standard norma. « 
the generated segment data, the vatae of constant C is set m such a way mat * > * 
Tjed population preva.ence of 5 per cent is reached. The liability value L defines 
me probability of an individua! to be affected, denoted by p, v,a formula 
25 log _e_ = L . Having discovered the affection status of each individual, the destred 
number of individuals labeled as affected was randomly selected to constitute the 
affected sample. 

The con.ro! samples were created using two different methods: for HPM-G that util- 
es genotype data, me control individuals were simply taken randomly from the 
30 entire population. To do mis, the sampling process described above was repeated, 
Tut mis time tire liability of each individual was purely random, mcludmg no ge- 
netic component. 
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For the original HPM that requires haplotype data, we used a more laborious sam- 
pling method: the genotypes of the parents of the affected individuals were col- 
lected to create family-based pseudocontrol chromosomes. This was done in prac- 
tice by taking the alleles in the non-transmitted chromosomal segments of the par- 
5 ents of each affected individual and labeling them as control chromosomes. In real- 
ity, this is a common practice. In the simulations we treated the haplotypes obtained 
from the simulator as given, which corresponds to error-free haplotyping, and is 
expected to slightly favor HPM in the comparisons. 

The simulation of missing data was based on the notion that in real genotyping 
10 laboratories there seems to be two types of clustering of missing data. First, missing 
genotypes tend to cluster to certain individuals, which can be a consequence of low 
quality samples. Second, certain markers may function poorly, likely producing 
missing genotypes. To mimic such clustering of missing data, we defined two pa- 
rameters: parameter a corresponds to the amount of missing data that clusters to 
15 individuals and parameter P to the amount that clusters to markers. The missing 
genotypes were selected using the following procedure: 

For each individual i, a personal missing genotype probability x\ was computed 
as the x value of the first random point in (*, y) plane (x, ye [0,1]) that satisfies the 
inequality y<l/e m . Having computed the value of variable x\ for the individual, 
20 each of his/her genotypes was then labeled as missing with probability x\ . In the 
second phase, the procedure was repeated for each marker. For each marker j, a 
marker failure probability x? was computed in an analogous fashion as the x value 
of the first random point in (*, y) plane (*, ye [0,1]) that satisfies the inequality 
y < Ve* , and each genotype corresponding to that marker was labeled as missing 

* Ml 

25 independently for each individual with probability x s . 

Values of variables a and /? were empirically adjusted to produce the desired over- 
all levels of missing data. These values were: 25 and 80 for 5%, and 13 and 40 for 
10% of missing data. 

Example 2 - Comparison to HPM 

30 The localization accuracy was explored by plotting curves similar to power graphs: 
the height of the curve shows the fraction of data sets for which the localization was 
successful, as a function of the length of the predicted region. The sample consisted 
of 150 affected and 150 control genotypes. The maximum length of a pattern was 7, 
and one gap of one marker was allowed. The association threshold was set to 10. 
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These numbers were based on experimentation. For comparison, we also show the 
corresponding curve for HPM with 1/3 smaller sample size, and thus equal genotyp- 
ing cost (figure 1). With HPM we used association threshold 9, the parameters for 
the patterns were the same than those used with HPM-G. 

5 The results show that HPM-G has a high accuracy, and that it is extremely competi- 
tive even in comparison to state-of-the-art methods that use explicitly haplotyped 
data. 

Example 3 - effect of sample size 

The effect of sample size was examined by experimenting with sample sizes of 
10 100+100, 150+150, 200+200 and 300+300 people (figure 2a). Figure 2b shows the 
corresponding results for HPM. 

HPM-G performs well even with only 100+100 genotypes. On the other hand, if 
the amount of data is increased, the accuracy is improved. 

Example 4 - effect of missing data 

15 The influence of missing data was explored by randomly removing 5% or 10% of 
marker genotypes (figure 3a). Figure 3b shows the corresponding results for HPM. 

These results show that HPM-G is very robust against missing data. 

Example 5 - Localization Accuracy with Permutation Tests 

Permutation tests were used to obtain more information about the significance of 
20 observed marker frequencies. Marker-wise P values were used to sort markers by 
their statistical unexpectedness, not to test the statistical significance of the findings. 
We performed the following experiment in order to see if the prediction accuracy 
can be improved by permutation tests. We predicted the location of the DS gene to 
be at the marker with the smallest P value instead of the most frequent marker. The 
25 localization accuracy with 100 permutations compared to that without permutations 
is shown in figure 4. The curves are almost identical, which is due to the evenly dis- 
tributed and identically informative markers. 

The situation could be different with real marker data, where permutation tests are 
likely to bring a greater benefit. 
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